Phylogenetic analysis based on single-copy orthologous proteins in highly variable chloroplast genomes of Corydalis

Corydalis is one of the few lineages that have been reported to have extensive large-scale chloroplast genome (cp-genome) rearrangements. In this study, novel cp-genome rearrangements of Corydalis pinnata, C. mucronate, and C. sheareri are described. C. pinnata is a narrow endemic species only distributed at Qingcheng Mountain in southwest China. Two independent relocations of the same four genes (trnM-CAU-rbcL) were found relocated from the typically posterior part of the large single-copy region to the front of it. A uniform inversion of an 11–14-kb segment (ndhB-trnR-ACG) was found in the inverted repeat region; and extensive losses of accD, clpP, and trnV-UAC genes were detected in all cp-genomes of all three species of Corydalis. In addition, a phylogenetic tree was reconstructed based on 31 single-copy orthologous proteins in 27 cp-genomes. This study provides insights into the evolution of cp-genomes throughout the genus Corydalis and also provides a reference for further studies on the taxonomy, identification, phylogeny, and genetic transformation of other lineages with extensive rearrangements in cp-genomes.


Phylogenetic analyses.
Using concatenated single-copy orthologous proteins to resolve phylogenic relationships could avoid rearrangement-misled phylogenetic tree reconstruction and provide a more reliable evolutionary framework compared with using several specific genes 18 . Therefore, the predicted proteome was used in the phylogenetic analyses rather than the whole cp-genome sequence. Based on 31 single-copy orthologous proteins conserved in 27 species with E. pleiosperma as the outgroup, a maximum-likelihood (ML) phylogenetic tree was reconstructed to illuminate the evolutionary history of the compared species (Fig. 6). The ML tree had three major clades: the Fumarioideae clade, Papaveroideae clade, and the clade with the rest of the Ranunculales family members. Corydalis constituted a monophyletic sub-clade nested within the Fumarioideae clade. All lineages within Corydalis were strongly supported. The three newly sequenced Corydalis cp-genomes, namely, C. pinnata (Sect. Mucronatae), C. mucronata (Sect. Mucronatae), and C. sheareri (Sect. Asterostigmata), were closely related.

Discussion
Although the three newly sequenced Corydalis cp-genomes from the same geographic region belong to two different subgenera of Corydalis, the sizes and structures of their LSC, IR, and SSC regions, as well as their total genomes, are highly similar. This includes similar gene losses, inversions, and relocations ( Fig. 1 and Supplementary Table 1), which are common features in the Corydalis cp-genomes and are considered to be responsible for the variation in cp-genome sizes 1 . The loss of three genes (accD, clpP, and trnV-UAC ) is a synapomorphic characteristic in the Corydalis cpgenomes (Supplementary Table 1). Xu et al. 1 speculated that the loss of the accD gene occurred before divergence of the genus Corydalis. However, in the present study, the accD gene was found in the cp-genomes of a few species of the subgenus Rapiferae (Supplementary Table 1), which indicated that the loss event happened after divergence of the genus Corydalis. The exact time of the loss event should be further explored by gathering more information on Corydalis cp-genomes. The accD gene is relocated to the nucleus in some species, such as some members of the family Campanulaceae 19,20 . The pseudogenization or loss of 11 chloroplast ndh genes that encode NADH dehydrogenase subunits only occurred in a few species of the genus Corydalis (C. conspersa, C. davidii, C. adunca, and C. inopinata; Supplementary Table 1). Strikingly, these species are all located in highaltitude areas (1000-5200 m a.s.l.) 21 . Therefore, extreme changes in the environment may result in gene deletions or pseudogenization; this phenomenon has been observed in other species 22 . Further studies are required to determine whether or not the pseudogenization or loss of ndh genes will affect photosynthesis in those plants.
The chloroplast genome, as a photosynthetic organelle, is highly conserved in terms of structure, gene content, and arrangement [23][24][25] . Large-scale rearrangement exists only occasionally in a few lineages, such as Campanulaceae 16 1-16, Fig. 2) of Corydalis plants, which determine the diversity in Corydalis cpgenomes. Repeat sequences may contribute to structural variations in relatively stable rearrangement regions [58][59][60] . Relocation only occurred in the LSC region of the Corydalis cp-genomes, and inversion only occurred in the IR and SSC regions (Fig. 2). This suggested that the patterns of relocation and inversion were regulated in different ways. In addition, blocks 1-16 are likely active rearrangement regions because they have various rearrangement patterns. C. hsiaowutaishanensis (subg. Corydalis), C. adunca (subg. Cremnocapnos), C. Saxicola, and C. fangshanensis (subg. Sophorocapnos) all underwent the inversion of blocks 10-16, but the inversion boundaries of C. hsiaowutaishanensis expanded into block 9, suggesting that the inversion of blocks 9-16 in C. hsiaowutaishanensis was an independent event. Furthermore, some species from different subgenera have the same relocation or inversion pattern, such as the three Corydalis plants (C. pinnata, C. mucronate, and C. sheareri) collected from Qingcheng Mountain in the current study. Although they represent two subgenera, these three species have an almost identical relocation/inversion pattern in their cp-genomes (Fig. 2). Moreover, blocks 5-7 underwent at least two inversions in C. tomentella; blocks 5-7 initially inversed independently and then inversed with blocks 3, 4, and 8. This active rearrangement suggested that relocation or inversion in Corydalis cp-genomes might be affected by the geographical environment.  www.nature.com/scientificreports/ Loss of introns and/or genes is instrumental in the regulation of gene expression and can control gene expression temporally and in a tissue-specific manner [61][62][63] .The regulation mechanisms of introns for gene expression in plants and animals have been reported [63][64][65] . However, the implications or link between gene expression and intron loss for Corydalis have not been published. Further experimental work on the roles of introns in Corydalis is therefore essential and should prove interesting. Highly variable DNA barcodes play an important role in species identification and phylogenetic analyses. In the current study, protein-coding genes (rps19, rpl22, ycf1, and ycf2), intron regions (paf1, ndhA, and rpl2), and the intergenic regions (trnQ-UUG-psbK, psbK-psbI, atpF-aptH, atpH-atpI, rpoB-trnC-GCA , trnC-GCA-petN, trnT-GGU-psbD, trnE-UUC-trnT-GGU , trnD-GUC-trnY-GUA , psaA-pafI, pafI-trnS-GGA , rps4-trnT-UGU , trnT-UGU-trnL-UAA , trnR-ACG-trnL-CAA , and trnN-GUU-ndhB) exhibited some extent of variation and have great potential as DNA markers (Fig. 4b).
Cp-genomes have made marked contributions to the phylogenetic studies of angiosperms and to resolving the evolutionary relationships within phylogenetic clades 66,67 . However, active rearrangement in Corydalis cpgenomes may mislead the reconstruction of species phylogenetic relationships based on DNA sequence of cpgenomes. Phylogenetic reconstruction of the genus Corydalis was previously explored with DNA barcoding 68 or relatively conserved nucleotide fragments in cp-genomes 1 . However, deep relationships remained poorly resolved by this phylogenetic approach applying a few plastid markers. Some studies reported that the protein-coding genes shared by all taxa could be used to reconstruct a phylogeny 2,34 . However, single-copy genes (SCGs) have subsequently emerged as candidates for phylogenetic analysis because paralogues are derived from duplication events other than speciation events and should therefore be discarded from phylogenetic analyses 69,70 . Therefore, the 31 single-copy orthologous proteins in all 27 cp-genomes were used to reconstruct the phylogeny of the genus Corydalis. Three distinct clades were defined by high bootstrap values (Fig. 6) in the resulting phylogenetic tree, which is consistent with previous studies based on molecular markers 1,71 . This indicated that the application of the single-copy orthologous proteins of cp-genomes can improve the resolution of the phylogeny and taxonomy of the genus Corydalis. Findings from the study also provide a reference for the taxonomy and identification of other plants with extensive rearrangement in cp-genomes.

Conclusions
The cp-genomes of three species of the genus Corydalis (C. pinnata, C. mucronata, and C. sheareri) from the Qingcheng Mountain in southwest China, including a narrow endemic species (C. pinnata), were characterized. The cp-genomes of the three species exhibited a large-scale rearrangement, including the relocation of four genes (trnM-CAU-rbcL) in the LSC region, the inversion of an 11-14-kb segment (ndhB-trnR-ACG ) in the IR region, and the loss of three genes (accD, clpP, and trnV-UAC ). The three Corydalis cp-genomes showed high similarity in terms of genome size, gene classes, gene sequences, rearrangement pattern, and distribution of repeat sequences. In addition, the structural alignment of 17 Corydalis cp-genomes with the typical chloroplast genomic structure of angiosperms (E. pleiosperma) revealed a frequent and extensive large-scale rearrangement in the Corydalis cp-genomes. Among them, the relocation of two blocks (trnM-CAU-rbcL and rps16) frequently appeared in the LSC region, and the inversion of four blocks (rpl23-trnL-CAA , ndhB-trnR-ACG , trnN-GUU , and ndhA-ycf1) frequently appeared in the IR and SSC regions. The extensive large-scale cp-genome rearrangement may mislead phylogenetic analysis based on cp-genomes. Single-copy orthologous proteins of cp-genomes were therefore used to reconstruct the phylogeny of the genus Corydalis. This method was concluded to have good prospects for elucidating the phylogeny and taxonomy of Corydalis and could potentially be employed for the phylogenetic analysis of other lineages with extensive rearranged cp-genomes in future studies. Findings from this study provide a reference for further studies on the taxonomy, identification, and evolution of the genus Corydalis.  Fig. 1). The specimens were identified by Professor Guihua Jiang.   10 . Total raw data from a sample was approximately 10.0 G, and > 300 million paired-end reads were attained. Raw data were filtered by Skewer-0.2.2 22 72 . The resulting reads were used for genome assembly by GetOrganelle version 1.7.5 73 . Another assembly for each species of the genus Corydalis was performed by ABYSS with C. edulis as the reference to confirm the GetOrganelle assemblies. The draft genome was used to map clean reads by BWA version 0.7.17 74 , and then clean reads were filtered using SAMtools version 1.7 75 . Mapping was visualized by IGV version 2.10.0 76 to check the concatenation of contigs 1 . Furthermore, junction splicing sites were verified with polymerase chain reaction (PCR) and Sanger sequencing. All of the contigs were aligned to the reference cp-genome of C. edulis with MUMmer version 4.0 77 . Finally, the sequences were extended and gaps were filled with SSPACE-3.0 78 .

Materials and methods
Gene annotation and sequence analyses. Sequence annotation was achieved by Plann version 1.1.2 79 using the cp-genome of C. conspersa as a reference and some manual correction. BLAST and Apollo 80 were used to check the start and stop codons and the intron/exon boundaries with the cp-genome of C. conspersa as a reference sequence. Complete cp-genome sequences were submitted to the NCBI. A physical map of the cp-genomes was generated with Organellar Genome OGDraw 81 (http:// ogdraw. mpimp-golm. mpg. de/). Genome structure analyses. To determine synteny and identify possible rearrangements, 19 cp-genomes were compared using Mauve 2.4.0 82 with the "progressiveMauve" algorithm, including 17 Corydalis cp-genomes, the cp-genome of Macleaya microcarpa (NC_039623) representing Papaveroideae, and the cp-genome of Euptelea pleiosperma (NC_029429) representing a typical angiosperm cp-genome. The Mauve result was then manually modified to show the notable rearrangements. The cp-genomes of species of the genus Corydalis were completed by mVISTA 83 (Shuffle-LAGAN mode) using the genome of C. edulis as the reference. Tandem Repeats Finder 84 was used to detect tandem repeats, forward repeats, and palindromic repeats as tested by REPuter 85 . SSRs were detected by Misa.pl 86 using search parameters of mononucleotides set to ≥ 10 repeat units, dinucleotides ≥ 8 repeat units, trinucleotides and tetranucleotides ≥ 4 repeat units, and pentanucleotides and hexanucleotides ≥ 3 repeat units.

Phylogenetic analyses.
Twenty-seven cp-genomes were used to reconstruct a phylogenetic tree. First, single-copy orthologous proteins were extracted by OrthoFinder version 2.3.8 87 . Next, genes were aligned by MUSCLE version 3.8, and then the best-fit models of amino acid substitution were estimated by ProtTest version 3.4 88 with the best corrected Akaike Information Criterion (AICc) value selected. Finally, a ML phylogenetic tree was reconstructed by RAxML version 8.2.12 89 including tree robustness assessment using 1000 replicates of rapid bootstrap with the HIVb + I + G + F substitution model based on the results of ProtTest.